# This section is to install package that will be required to proceed with this section of the workshop.
# If you do not have "devtools" installed, uncomment the following line and run on your console
# install.packages("devtools")
# devtools::install_github("SydneyBioX/scdney")
You may visit our package website for the vignette of scdney package.
# We will subset Su et al. and Yang et al. datasets.
ids = which(colData(sce_scMerge)$batch %in% c("GSE87795", "GSE90047"))
lab = colData(sce_scMerge)$cellTypes[ids]
nCs = length(table(lab))
mat = SummarizedExperiment::assay(sce_scMerge, "scMerge")[,ids]
par(mar=c(10,7,2,2))
barplot(table(lab), las = 2, ylab = "No. of cells", main = "Cell type proportion")
We observe that hepatoblast/hepatocyte is the largest population.
# create tsne object
set.seed(123)
tsne_result = Rtsne(t(mat), check_duplicates = FALSE)
plot(tsne_result$Y, col = factor(lab) , main = "t-SNE plot (cells coloured by known cell types")
The data from single cell RNA-sequencing experiment does not provide cell type information for individual cells. We need to identify cell types of indivdual cells in our data with bioinformatics analysis.
One common method to explore single cell data is by clustering. Clustering is a type of machine learning technique to group similar samples (cell) to the same group and partition samples that are different by comparing their feature information (gene expression). To optimise clustering method, algorithm and similarity metric are two key components that affects clustering performance.
In our recent study, we found pearson correlation to be optimal similarity metric for comparing single cell RNA-seq data ( Kim et al., 2018 ). Therefore in this workshop, we will utilise scClust in our scdney package, which implemented 2017 Nature methods clustering algorithm SIMLR with pearson correlation.
We need to validate the number of clusters (k) in this dataset. there actually are in an unsupervised approach.
# k = 6
simlr_result_k6 = scClust(mat, 6, similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Iteration: 9
## Iteration: 10
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 0.1527458
## Epoch: Iteration # 200 error is: 0.1218641
## Epoch: Iteration # 300 error is: 0.1112158
## Epoch: Iteration # 400 error is: 0.1059176
## Epoch: Iteration # 500 error is: 0.1026344
## Epoch: Iteration # 600 error is: 0.1003277
## Epoch: Iteration # 700 error is: 0.0985936
## Epoch: Iteration # 800 error is: 0.09722343
## Epoch: Iteration # 900 error is: 0.09610318
## Epoch: Iteration # 1000 error is: 0.09516611
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 11.56649
## Epoch: Iteration # 200 error is: 0.2845774
## Epoch: Iteration # 300 error is: 0.2560054
## Epoch: Iteration # 400 error is: 0.2410104
## Epoch: Iteration # 500 error is: 0.2361371
## Epoch: Iteration # 600 error is: 0.2331395
## Epoch: Iteration # 700 error is: 0.231009
## Epoch: Iteration # 800 error is: 0.229431
## Epoch: Iteration # 900 error is: 0.2281798
## Epoch: Iteration # 1000 error is: 0.2271669
# We will not run for various `k` to save time. Instead, we will load pre-computed result
# all_k = 3:8
# simlr.results = sapply(as.character(all_k), function(k) {
# scClust(mat, as.numeric(k), similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
# }, USE.NAMES = TRUE, simplify = FALSE)
load(paste(getwd(), "/data/simlr.results.RData", sep = ""))
kall_wss = sapply(simlr_results, function(result) {
sum(result$y$withinss)
}, USE.NAMES = TRUE, simplify = TRUE)
plot(x = names(all_wss), y = all_wss, main = "Total WSS for each k", xlab = "k", ylab = "total WSS")
As shown in this plot, the graph begin to plateau from k = 5. We can estimate that k is around 5 or 6. We can further investigate using silhouette scores or other metrics but using our t-SNE plot (and with a little cheating) we will estimate k = 6.
plot(tsne_result$Y, col = simlr_result_k6$y$cluster, main = "t-SNE plot coloured by cluster output")
mclust::adjustedRandIndex(lab, simlr_result_k6$y$cluster)
## [1] 0.5752176
igraph::compare(as.numeric(factor(simlr_result_k6$y$cluster)), as.numeric(factor(lab)), method = "nmi")
## [1] 0.6873722
par(mfrow=c(1,2))
plot(tsne_result$Y, col = factor(lab), main = "t-SNE coloured by cell type information")
plot(tsne_result$Y, col = simlr_result_k6$y$cluster, main = "t-SNE coloured by our result")
There are 6 cell types in the dataset. We have obtained some marker genes for all the 6 cell types through some literature search. In this section, we will explain how we can visualise the distribution of marker genes.
Firstly, we can colour the t-SNE plot using the provided cell labels. In the figure eblow, Endothelial cells is red and the rest of the cells is black.
We can then colour the t-SNE plot according to expression value of Endothelial cells marker genes.
For example, in the code provided below, we first used which() to find out which row contains Sox17 gene. We then used ifelse() to indicate that, if a cell has Sox17 expression value < 2, then this cell will be coloured red, otherwise it will be coloured green.
The figure below illustrates that most of the cells in the top right cluster is coloured green and almost all the other cells is coloured red. This indicates that cells in the top right cluster are indeed different from the remaining cells in their expression pattern of Sox17. This potentially suggests that these cells belong to endothelial cells.
plot(tsne_result$Y, col = factor(lab == "Endothelial Cell"), main = "Endothelial Cell" )
index <- which(rownames(sce_scMerge[,ids]) == "Sox17")
plot(tsne_result$Y, col = ifelse( mat[2,] < 2 ,'red','green'), main = "Sox17")
index <- which(rownames(sce_scMerge[,ids]) == "Sox18")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Sox18")
index <- which(rownames(sce_scMerge[,ids]) == "Sox7")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Sox7")
We can repeat the procudure on other cell type of interest.
plot(tsne_result$Y, col = factor(lab == "cholangiocyte") , main = "cholangiocyte")
index <- which(rownames(sce_scMerge[,ids]) == "Sox9")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 6 ,'red','green'), main = "Sox9")
index <- which(rownames(sce_scMerge[,ids]) == "Epcam")
hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 6 ,'red','green'), main = "Epcam")
index <- which(rownames(sce_scMerge[,ids]) == "Krt19")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Krt19")
plot(tsne_result$Y, col = factor(lab == "Mesenchymal Cell"), main = "Mesenchymal Cell")
index <- which(rownames(sce_scMerge[,ids]) == "Map2")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] > -1.5 ,'red','green'), main = "Map2")
index <- which(rownames(sce_scMerge[,ids]) == "Gfap")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 0.6 ,'red','green'), main = "Gfap")
plot(tsne_result$Y, col = factor(lab == "hepatoblast/hepatocyte"), main = "hepatoblast/hepatocyte" )
index <- which(rownames(sce_scMerge[,ids]) == "Foxa2")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Foxa2")
index <- which(rownames(sce_scMerge[,ids]) == "Hnf4a")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "Hnf4a")
plot(tsne_result$Y, col = factor(lab == "Hematopoietic"), main = "Hematopoietic" )
index <- which(rownames(sce_scMerge[,ids]) == "Mpl")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] <1.5 ,'red','green'), main = "Mpl")
plot(tsne_result$Y, col = factor(lab == "Immune cell"), main = "Immune cell")
index <- which(rownames(sce_scMerge[,ids]) == "S100a9")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 2 ,'red','green'), main = "S100a9")
index <- which(rownames(sce_scMerge[,ids]) == "Fcer1g")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 5 ,'red','green'), main = "Fcer1g")
index <- which(rownames(sce_scMerge[,ids]) == "Cd69")
#hist(mat[index,])
plot(tsne_result$Y, col = ifelse( mat[index,] < 5 ,'red','green'), main = "Cd69")
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] cluster_2.0.9 Rtsne_0.15
## [3] mclust_5.4.3 scdney_0.1.2
## [5] edgeR_3.26.4 limma_3.40.2
## [7] dplyr_0.8.1 SingleCellExperiment_1.6.0
## [9] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [11] BiocParallel_1.18.0 matrixStats_0.54.0
## [13] Biobase_2.44.0 GenomicRanges_1.36.0
## [15] GenomeInfoDb_1.20.0 IRanges_2.18.0
## [17] S4Vectors_0.22.0 BiocGenerics_0.30.0
##
## loaded via a namespace (and not attached):
## [1] amap_0.8-17 minqa_1.2.4 colorspace_1.4-1
## [4] class_7.3-15 ggridges_0.5.1 htmlTable_1.13.1
## [7] XVector_0.24.0 base64enc_0.1-3 rstudioapi_0.10
## [10] mice_3.5.0 prodlim_2018.04.18 manipulate_1.0.1
## [13] mvtnorm_1.0-10 lubridate_1.7.4 codetools_0.2-16
## [16] splines_3.6.0 doParallel_1.0.14 knitr_1.23
## [19] Formula_1.2-3 nloptr_1.2.1 caret_6.0-84
## [22] clusteval_0.1 broom_0.5.2 compiler_3.6.0
## [25] backports_1.1.4 assertthat_0.2.1 Matrix_1.2-17
## [28] lazyeval_0.2.2 acepack_1.4.1 htmltools_0.3.6
## [31] tools_3.6.0 igraph_1.2.4.1 gtable_0.3.0
## [34] glue_1.3.1 GenomeInfoDbData_1.2.1 reshape2_1.4.3
## [37] Rcpp_1.0.1 nlme_3.1-140 iterators_1.0.10
## [40] timeDate_3043.102 xfun_0.7 gower_0.2.1
## [43] stringr_1.4.0 lme4_1.1-21 dendextend_1.12.0
## [46] pan_1.6 zlibbioc_1.30.0 MASS_7.3-51.4
## [49] scales_1.0.0 ipred_0.9-9 doSNOW_1.0.16
## [52] expm_0.999-4 RColorBrewer_1.1-2 yaml_2.2.0
## [55] gridExtra_2.3 ggplot2_3.1.1 rpart_4.1-15
## [58] latticeExtra_0.6-28 stringi_1.4.3 randomForest_4.6-14
## [61] foreach_1.4.4 e1071_1.7-1 checkmate_1.9.3
## [64] boot_1.3-22 lava_1.6.5 rlang_0.3.4
## [67] pkgconfig_2.0.2 bitops_1.0-6 evaluate_0.14
## [70] lattice_0.20-38 purrr_0.3.2 recipes_0.1.5
## [73] htmlwidgets_1.3 tidyselect_0.2.5 plyr_1.8.4
## [76] magrittr_1.5 R6_2.4.0 snow_0.4-3
## [79] DescTools_0.99.28 generics_0.0.2 Hmisc_4.2-0
## [82] mitml_0.3-7 pillar_1.4.1 foreign_0.8-71
## [85] withr_2.1.2 survival_2.44-1.1 RCurl_1.95-4.12
## [88] nnet_7.3-12 tibble_2.1.2 crayon_1.3.4
## [91] jomo_2.6-8 rmarkdown_1.13 viridis_0.5.1
## [94] locfit_1.5-9.1 grid_3.6.0 data.table_1.12.2
## [97] ModelMetrics_1.2.2 digest_0.6.19 tidyr_0.8.3
## [100] munsell_0.5.0 viridisLite_0.3.0